Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:
I went to the
the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.
The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set.
The motivation for this project is to:
Review criteria
library(stringr)
library(dplyr)
library(quanteda)
library(readtext)
library(R.utils)
library(ggplot2)
set.seed(3301)
Tasks to accomplish
Questions to consider
Dwonload data.
source("downloadData.R")
attach(downloadData(file.path("..", "data")))
c(blogs, twitter, news, badwords)
## [1] "../data/final/en_US/en_US.blogs.txt"
## [2] "../data/final/en_US/en_US.twitter.txt"
## [3] "../data/final/en_US/en_US.news.txt"
## [4] "../data/bad-words.txt"
First, try to processing entire files using our scratch implementation:
tweets <- 0
wordsTwitter <- 0
sentencesTwitter <- 0
con <- file(twitter, "r")
while (FALSE && length(oneLine <- readLines(con, 1, warn = FALSE)) > 0) {
# Count tweet
tweets <- tweets + 1
# Show first 10 tweet
if(tweets <= 10) {
print(oneLine)
}
# Tokenize by regular expression
words <- str_split(oneLine, "\\s+")[[1]]
# To detect symbols like a ':)', initialise variable
symbols <- rep(FALSE, length = length(words))
# Each token:
for(i in 1:length(words)) {
# Extract token that has only symbol string
symbols[i] <- grepl("^[^a-zA-Z0-9]+$", words[i])
# numbers, aggregate in '[numbers]'
if(grepl("^[0-9]+$", words[i])) {
words[i] <- "[numbers]"
}
}
# Tokens
wordsPerLine <- length(simpleWords <- words[!symbols])
# Count tokens ending with punctuation as the number of sentences
for(i in 1:length(simpleWords)){
if(grepl("[.!?]$", simpleWords[i])) {
sentencesTwitter <- sentencesTwitter + 1
}
}
wordsTwitter <- wordsTwitter + wordsPerLine
remove(simpleWords, words)
}
close(con)
tweets
wordsTwitter
sentencesTwitter
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [1] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [1] "they've decided its more fun if I don't."
## [1] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"
## [1] "Words from a complete stranger! Made my birthday even better :)"
## [1] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"
## [1] "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing"
## [1] "I'm coo... Jus at work hella tired r u ever in cali"
## [1] "The new sundrop commercial ...hehe love at first sight"
## [1] "we need to reconnect THIS WEEK"
## [1] 2360148
## [1] 29706404
## [1] 2818583
It takes a long time to calculate, so re-implement it using the package.
Loading files using the readtext package.
tweetFile <- readtext(twitter)
corpusTwitter <- corpus(tweetFile, cache = FALSE)
summary(corpusTwitter)
## Corpus consisting of 1 document:
##
## Text Types Tokens Sentences
## en_US.twitter.txt 566951 36719658 2588551
##
## Source: /Users/warhol/Documents/!work/Data-Science-Capstone/MilestoneReport/* on x86_64 by warhol
## Created: Sun Jul 29 11:44:51 2018
## Notes:
Tasks to accomplish
Tips, tricks, and hints
Loading the data in. This dataset is fairly large. We emphasize that you don’t necessarily need to load the entire dataset in to build your algorithms (see point 2 below). At least initially, you might want to use a smaller subset of the data. Reading in chunks or lines using R’s readLines or scan functions can be useful. You can also loop over each line of text by embedding readLines within a for/while loop, but this may be slower than reading in large chunks at a time. Reading pieces of the file at a time will require the use of a file connection in R. For example, the following code could be used to read the first few lines of the English Twitter dataset:con <- file(“en_US.twitter.txt”, “r”) readLines(con, 1) ## Read the first line of text readLines(con, 1) ## Read the next line of text readLines(con, 5) ## Read in the next 5 lines of text close(con) ## It’s important to close the connection when you are done See the ?connections help page for more information.
Sampling. To reiterate, to build models you don’t need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. Remember your inference class and how a representative sample can be used to infer facts about a population. You might want to create a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file. That way, you can store the sample and not have to recreate it every time. You can use the rbinom function to “flip a biased coin” to determine whether you sample a line of text or not.
Sub-Sampling.
tweets <- as.numeric(countLines(twitter))
twitterSubSampling <- paste0(twitter, ".sub-sampling.txt")
if(!file.exists(twitterSubSampling)) {
subSamplingRate <- .01
flipABiasedCoin <- rbinom(tweets, size = 1, prob = subSamplingRate)
conRead <- file(twitter, "r")
conWrite <- file(twitterSubSampling, "w")
len <- 0
while (length(oneLine <- readLines(conRead, 1, warn = FALSE)) > 0) {
len <- len + 1
if(flipABiasedCoin[len] == 1) {
writeLines(oneLine, conWrite)
}
}
close(conRead)
close(conWrite)
}
subTweets <- as.numeric(countLines(twitterSubSampling))
subTweets
## [1] 23662
Tokenization.
subTweetFile <- readtext(twitterSubSampling)
subTwitterCorpus <- corpus(subTweetFile, cache = FALSE)
summary(subTwitterCorpus)
## Corpus consisting of 1 document:
##
## Text Types Tokens Sentences
## en_US.twitter.txt.sub-sampling.txt 34531 368276 26039
##
## Source: /Users/warhol/Documents/!work/Data-Science-Capstone/MilestoneReport/* on x86_64 by warhol
## Created: Sun Jul 29 17:19:37 2018
## Notes:
Load bad words.
profanity <- readLines(badwords)
Tasks to accomplish
Questions to consider
| Field | Unit | Sample sequence | 1-gram sequence | 2-gram sequence | 3-gram sequence |
|---|---|---|---|---|---|
| Computational linguistics | word | … to be or not to be … | …, to, be, or, not, to, be, … | …, to be, be or, or not, not to, to be, … | …, to be or, be or not, or not to, not to be, … |
Top 20.
subTweetsDfm <- subTwitterCorpus %>%
# nomarize words
tokens(remove_punct = TRUE,
remove_numbers = TRUE) %>%
# removing profanity and other words
# tokens_remove(stopwords('english')) %>%
tokens_remove(profanity)
topfeatures(dfm(subTweetsDfm), 20)
## the to i a you and for in of is it my on that me
## 9228 7770 7260 6194 5443 4470 3890 3725 3658 3475 3074 2988 2741 2292 2100
## be at with have your
## 1857 1816 1728 1712 1711
Plot word cloud.
dfm(subTweetsDfm) %>%
# dfm_trim(min_termfreq = 10,
# verbose = FALSE) %>%
textplot_wordcloud(min_count = 6,
random_order = FALSE,
rotation = .25,
color = RColorBrewer::brewer.pal(8, "Dark2"))
Nomarize words.
subTweetsDfmNomarized <- subTwitterCorpus %>%
# nomarize words
tokens(remove_punct = TRUE,
remove_numbers = TRUE) %>%
# removing profanity and other words
tokens_remove(stopwords('english')) %>%
tokens_remove(profanity)
Top 20 Nomarized words.
topfeatures(dfm(subTweetsDfmNomarized), 20)
## just like get love good thanks day can now rt
## 1528 1218 1182 1060 1058 918 917 905 871 862
## one great know new time u today go see lol
## 803 782 780 762 757 741 741 694 682 673
Plot word cloud.
dfm(subTweetsDfmNomarized) %>%
# dfm_trim(min_termfreq = 10,
# verbose = FALSE) %>%
textplot_wordcloud(min_count = 6,
random_order = FALSE,
max_words = 100,
rotation = .25,
color = RColorBrewer::brewer.pal(8, "Dark2"))
Frequency Plots
featuresTweetsDfm <- textstat_frequency(dfm(subTweetsDfmNomarized), n = 80)
# Sort by reverse frequency order
featuresTweetsDfm$feature <- with(featuresTweetsDfm, reorder(feature, -frequency))
ggplot(featuresTweetsDfm, aes(x = feature, y = frequency)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
2-Gram
subTweetsDfmNomarized2Gram <- subTwitterCorpus %>%
# nomarize words
tokens(remove_punct = TRUE,
remove_numbers = TRUE) %>%
# removing profanity and other words
# tokens_remove(stopwords('english')) %>%
tokens_remove(profanity) %>%
tokens_ngrams(n = 2)
topfeatures(dfm(subTweetsDfmNomarized2Gram), 20)
## in_the for_the of_the on_the to_be thanks_for
## 777 714 592 466 453 446
## to_the i_love thank_you at_the if_you i_have
## 430 374 363 360 344 317
## i_am for_a have_a going_to to_see will_be
## 316 314 313 303 285 277
## is_a to_get
## 269 263
dfm(subTweetsDfmNomarized2Gram) %>%
# dfm_trim(min_termfreq = 10,
# verbose = FALSE) %>%
textplot_wordcloud(min_count = 6,
random_order = FALSE,
max_words = 100,
rotation = .25,
color = RColorBrewer::brewer.pal(8, "Dark2"))
Frequency Plots
featuresTweetsDfm2Gram <- textstat_frequency(dfm(subTweetsDfmNomarized2Gram), n = 80)
# Sort by reverse frequency order
featuresTweetsDfm2Gram$feature <- with(featuresTweetsDfm2Gram, reorder(feature, -frequency))
ggplot(featuresTweetsDfm2Gram, aes(x = feature, y = frequency)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
3-Gram
subTweetsDfmNomarized3Gram <- subTwitterCorpus %>%
# nomarize words
tokens(remove_punct = TRUE,
remove_numbers = TRUE) %>%
# removing profanity and other words
#tokens_remove(stopwords('english')) %>%
tokens_remove(profanity) %>%
tokens_ngrams(n = 3)
topfeatures(dfm(subTweetsDfmNomarized3Gram), 20)
## thanks_for_the thank_you_for looking_forward_to
## 231 91 90
## i_love_you can't_wait_to for_the_follow
## 88 82 76
## i_want_to one_of_the going_to_be
## 72 65 61
## to_see_you have_a_great a_lot_of
## 61 59 58
## i_need_to i_have_to i'm_going_to
## 53 51 49
## is_going_to you_want_to thanks_for_following
## 47 46 45
## how_are_you let_me_know
## 44 44
dfm(subTweetsDfmNomarized3Gram) %>%
textplot_wordcloud(#min_count = 4,
random_order = FALSE,
max_words = 50,
rotation = .25,
color = RColorBrewer::brewer.pal(8, "Dark2"))
Frequency Plots
featuresTweetsDfm3Gram <- textstat_frequency(dfm(subTweetsDfmNomarized3Gram), 60)
# Sort by reverse frequency order
featuresTweetsDfm3Gram$feature <- with(featuresTweetsDfm3Gram, reorder(feature, -frequency))
ggplot(featuresTweetsDfm3Gram, aes(x = feature, y = frequency)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
1-gram 90%tile:
featuresTweetsDfmFull <- textstat_frequency(dfm(subTweetsDfmNomarized))
summary(featuresTweetsDfmFull)
## feature frequency rank docfreq
## Length:25808 Min. : 1.000 Min. : 1 Min. :1
## Class :character 1st Qu.: 1.000 1st Qu.: 6453 1st Qu.:1
## Mode :character Median : 1.000 Median :12904 Median :1
## Mean : 6.404 Mean :12904 Mean :1
## 3rd Qu.: 3.000 3rd Qu.:19356 3rd Qu.:1
## Max. :1528.000 Max. :25808 Max. :1
## group
## Length:25808
## Class :character
## Mode :character
##
##
##
quantile(featuresTweetsDfmFull$frequency, c(0, .1, .5, .9, 1))
## 0% 10% 50% 90% 100%
## 1 1 1 9 1528
2-gram 90%tile:
featuresTweetsDfm2GramFull <- textstat_frequency(dfm(subTweetsDfmNomarized2Gram))
summary(featuresTweetsDfm2GramFull)
## feature frequency rank docfreq
## Length:161608 Min. : 1.000 Min. : 1 Min. :1
## Class :character 1st Qu.: 1.000 1st Qu.: 40403 1st Qu.:1
## Mode :character Median : 1.000 Median : 80804 Median :1
## Mean : 1.808 Mean : 80804 Mean :1
## 3rd Qu.: 1.000 3rd Qu.:121206 3rd Qu.:1
## Max. :777.000 Max. :161608 Max. :1
## group
## Length:161608
## Class :character
## Mode :character
##
##
##
quantile(featuresTweetsDfm2GramFull$frequency, c(0, .1, .5, .9, 1))
## 0% 10% 50% 90% 100%
## 1 1 1 2 777
3-gram 90%tile:
featuresTweetsDfm3GramFull <- textstat_frequency(dfm(subTweetsDfmNomarized3Gram))
summary(featuresTweetsDfm3GramFull)
## feature frequency rank docfreq
## Length:260218 Min. : 1.000 Min. : 1 Min. :1
## Class :character 1st Qu.: 1.000 1st Qu.: 65055 1st Qu.:1
## Mode :character Median : 1.000 Median :130110 Median :1
## Mean : 1.123 Mean :130110 Mean :1
## 3rd Qu.: 1.000 3rd Qu.:195164 3rd Qu.:1
## Max. :231.000 Max. :260218 Max. :1
## group
## Length:260218
## Class :character
## Mode :character
##
##
##
quantile(featuresTweetsDfm3GramFull$frequency, c(0, .1, .5, .9, 1))
## 0% 10% 50% 90% 100%
## 1 1 1 1 231
ntoken(subTweetsDfmNomarized)
## en_US.twitter.txt.sub-sampling.txt
## 165278
ntype(subTweetsDfmNomarized)
## en_US.twitter.txt.sub-sampling.txt
## 32487
Seems like Zipf’s law.
Tasks to accomplish
Questions to consider
Basic 2-gram model:
nextWords2Gram <- function(input, outputs = 5, k = 0) {
# k is the least important of the parameters. It is usually chosen to be 0.
# However, empirical testing may find better values for k.
featuresNextWord <- NULL
# extract n-gram that starts with input
nextWordDfm <- dfm(tokens_select(subTweetsDfmNomarized2Gram,
phrase(paste0(
input, "_*"
))))
if (length(nextWordDfm) > k) {
# top n frequency stat
featuresNextWord <-
textstat_frequency(nextWordDfm, n = outputs)
# human readable outputs
featuresNextWord$feature <-
sapply(as.vector(featuresNextWord$feature),
function(x) {
str_split(x, "_")[[1]][2]
})
# Sort by reverse frequency order
featuresNextWord$feature <-
with(featuresNextWord,
reorder(feature,-frequency))
} else {
}
featuresNextWord
}
Next word of Looking is:
ggplot(nextWords2Gram("Looking"), aes(x = feature, y = frequency)) +
geom_bar(stat = "identity") +
xlab("Next word")
Next word of forward is:
ggplot(nextWords2Gram("forward"), aes(x = feature, y = frequency)) +
geom_bar(stat = "identity") +
xlab("Next word")
I went
ggplot(nextWords2Gram("went"), aes(x = feature, y = frequency)) +
geom_bar(stat = "identity") +
xlab("Next word")
I went to
ggplot(nextWords2Gram("to"), aes(x = feature, y = frequency)) +
geom_bar(stat = "identity") +
xlab("Next word")
I went to be
ggplot(nextWords2Gram("be"), aes(x = feature, y = frequency)) +
geom_bar(stat = "identity") +
xlab("Next word")
I went to be a
ggplot(nextWords2Gram("a"), aes(x = feature, y = frequency)) +
geom_bar(stat = "identity") +
xlab("Next word")
I went to be a great
ggplot(nextWords2Gram("great"), aes(x = feature, y = frequency)) +
geom_bar(stat = "identity") +
xlab("Next word")
I went to be a great day …
simpleGoodTuring <- function(r, Nr) {
# number of words
N <- sum(r * Nr)
d <- diff(r)
## Turing estimate
# turing estimate index
ti <- which(d == 1)
# discount coefficients of Turing estimate
dct <- numeric(length(r))
dct[ti] <- (r[ti] + 1) / r[ti] * c(Nr[-1], 0)[ti] / Nr[ti]
## Linear Good-Turing estimate
Zr <- Nr / c(1, 0.5 * (d[-1] + d[-length(d)]), d[length(d)])
f <- lsfit(log(r), log(Zr))
coef <- f$coef
# corrected term frequency
rc <- r * (1 + 1 / r)^(1 + coef[2])
# discount coefficients of Linear Good-Turing estimate
dclgt <- rc / r
## make switch from Turing to LGT estimates
# standard deviation of term frequencies between 'r' and 'rc' (?)
rsd <- rep(1,length(r))
rsd[ti] <- (seq_len(length(r))[ti] + 1) / Nr[ti] * sqrt(Nr[ti + 1] * (1 + Nr[ti + 1] / Nr[ti]))
dc <- dct
for (i in 1:length(r)) {
if (abs(dct[i] - dclgt[i]) * r[i] / rsd[i] <= 1.65) {
dc[i:length(dc)] <- dclgt[i:length(dc)]
break
}
}
## renormalize the probabilities for observed objects
# summation of probabilities
sump <- sum(dc * r * Nr) / N
# renormalized discount coefficients
dcr <- (1 - Nr[1] / N) * dc / sump
# term frequency
tf <- c(Nr[1] / N, r * dcr)
p <- c(Nr[1] / N, r * dcr / N)
names(p) <- names(tf) <- c(0, r)
list(p = p, r = tf)
}
NrTbl <- textstat_frequency(dfm(subTweetsDfmNomarized3Gram)) %>%
select(frequency) %>%
mutate(freqOfFrequency = 1) %>%
group_by(frequency) %>%
summarise_all(sum)
Nr <- NrTbl$freqOfFrequency
r <- NrTbl$frequency
sgt <- simpleGoodTuring(r, Nr)
dTrigram <- function(freq) {
sgt$r[as.character(freq)] / freq
}
nextWords <- function(input, ngram = 3, outputs = 3, k = 0) {
# k is the least important of the parameters. It is usually chosen to be 0.
# However, empirical testing may find better values for k.
inputs <- str_split(input, "\\s+")[[1]]
inputsSize <- length(inputs)
if (inputsSize < ngram - 1) { return() }
triGram <- paste(inputs[inputsSize - 1],
inputs[inputsSize],
sep = "_")
biGram <- inputs[inputsSize]
featuresNextWord <- NULL
# extract n-gram that starts with input
nextWordDfm <- dfm(tokens_select(subTweetsDfmNomarized3Gram,
phrase(paste0(
triGram, "_*"
))))
if (length(nextWordDfm) > k) {
prevWordDfm <- dfm(tokens_select(subTweetsDfmNomarized2Gram,
phrase(triGram)))
prevWordFreq <- textstat_frequency(prevWordDfm)$frequency
# data frame
featuresNextWord <-
textstat_frequency(nextWordDfm) %>%
mutate(p_bo = dTrigram(frequency) * frequency / prevWordFreq)
# human readable outputs
featuresNextWord$feature <-
sapply(as.vector(featuresNextWord$feature),
function(x) {
str_split(x, "_")[[1]][3]
})
# Sort by reverse frequency order
featuresNextWord$feature <-
with(featuresNextWord,
reorder(feature,-p_bo))
} else {
# beta <-
}
featuresNextWord %>% slice(1:5)
}
I went to be
ggplot(nextWords("I went to be"), aes(x = feature, y = p_bo)) +
geom_bar(stat = "identity") +
xlab("Next word") + ylab("P_bo")